user input
- Asia > China > Hong Kong (0.04)
- North America > Dominican Republic (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- (2 more...)
Optimizing Prompts for Text-to-Image Generation
Well-designed prompts can guide text-to-image models to generate amazing images. However, the performant prompts are often model-specific and misaligned with user input. Instead of laborious human engineering, we propose prompt adaptation, a general framework that automatically adapts original user input to model-preferred prompts. Specifically, we first perform supervised fine-tuning with a pretrained language model on a small collection of manually engineered prompts. Then we use reinforcement learning to explore better prompts. We define a reward function that encourages the policy to generate more aesthetically pleasing images while preserving the original user intentions. Experimental results on Stable Diffusion show that our method outperforms manual prompt engineering in terms of both automatic metrics and human preference ratings. Moreover, reinforcement learning further boosts performance, especially on out-of-domain prompts.
GaussianCut: Interactive segmentation via graph cut for 3D Gaussian Splatting
We introduce GaussianCut, a new method for interactive multiview segmentation of scenes represented as 3D Gaussians. Our approach allows for selecting the objects to be segmented by interacting with a single view. It accepts intuitive user input, such as point clicks, coarse scribbles, or text. Using 3D Gaussian Splatting (3DGS) as the underlying scene representation simplifies the extraction of objects of interest which are considered to be a subset of the scene's Gaussians. Our key idea is to represent the scene as a graph and use the graph-cut algorithm to minimize an energy function to effectively partition the Gaussians into foreground and background. To achieve this, we construct a graph based on scene Gaussians and devise a segmentation-aligned energy function on the graph to combine user inputs with scene properties. To obtain an initial coarse segmentation, we leverage 2D image/video segmentation models and further refine these coarse estimates using our graph construction. Our empirical evaluations show the adaptability of GaussianCut across a diverse set of scenes.
Capability-Driven Skill Generation with LLMs: A RAG-Based Approach for Reusing Existing Libraries and Interfaces
da Silva, Luis Miguel Vieira, Köcher, Aljosha, König, Nicolas, Gehlhoff, Felix, Fay, Alexander
Modern automation systems increasingly rely on modular architectures, with capabilities and skills as one solution approach. Capabilities define the functions of resources in a machine-readable form and skills provide the concrete implementations that realize those capabilities. However, the development of a skill implementation conforming to a corresponding capability remains a time-consuming and challenging task. In this paper, we present a method that treats capabilities as contracts for skill implementations and leverages large language models to generate executable code based on natural language user input. A key feature of our approach is the integration of existing software libraries and interface technologies, enabling the generation of skill implementations across different target languages. We introduce a framework that allows users to incorporate their own libraries and resource interfaces into the code generation process through a retrieval-augmented generation architecture. The proposed method is evaluated using an autonomous mobile robot controlled via Python and ROS 2, demonstrating the feasibility and flexibility of the approach.
- Research Report (1.00)
- Workflow (0.95)
QBR: A Question-Bank-Based Approach to Fine-Grained Legal Knowledge Retrieval for the General Public
Yuan, Mingruo, Kao, Ben, Wu, Tien-Hsuan
Retrieval of legal knowledge by the general public is a challenging problem due to the technicality of the professional knowledge and the lack of fundamental understanding by laypersons on the subject. Traditional information retrieval techniques assume that users are capable of formulating succinct and precise queries for effective document retrieval. In practice, however, the wide gap between the highly technical contents and untrained users makes legal knowledge retrieval very difficult. We propose a methodology, called QBR, which employs a Questions Bank (QB) as an effective medium for bridging the knowledge gap. We show how the QB is used to derive training samples to enhance the embedding of knowledge units within documents, which leads to effective fine-grained knowledge retrieval. We discuss and evaluate through experiments various advantages of QBR over traditional methods. These include more accurate, efficient, and explainable document retrieval, better comprehension of retrieval results, and highly effective fine-grained knowledge retrieval. We also present some case studies and show that QBR achieves social impact by assisting citizens to resolve everyday legal concerns.
- North America > United States (0.28)
- North America > Canada (0.04)
- Asia > China > Hong Kong (0.04)
- (4 more...)
- Law (1.00)
- Health & Medicine > Therapeutic Area (0.46)
- Government > Regional Government (0.46)
- Asia > China > Liaoning Province > Dalian (0.05)
- Asia > China > Hong Kong (0.05)
- North America > Canada > Quebec > Montreal (0.04)
- Asia > China > Guangdong Province > Guangzhou (0.04)
LLM Reinforcement in Context
LLM alignment techniques currently struggle with enforcing desired characteristics and harmlessness of outputs over long conversational contexts and chains-of-thought. In this paper we present the scaling problem, a mathematical formulation of this difficulty, and propose interruptions as a means to achieve LLM alignment in scaling contexts. We call this reinforcement in context. Paper structure is as follows: section 1 is this introduction and section 2 presents the scaling problem. In section 3 we describe interruptions as a means to solve the alignment scaling problem. In section 4 we discuss consequences and limitations and in section 5 we highlight avenues for future research.
Live Music Models
Lyria Team, null, Caillon, Antoine, McWilliams, Brian, Tarakajian, Cassie, Simon, Ian, Manco, Ilaria, Engel, Jesse, Constant, Noah, Li, Yunpeng, Denk, Timo I., Lalama, Alberto, Agostinelli, Andrea, Huang, Cheng-Zhi Anna, Manilow, Ethan, Brower, George, Erdogan, Hakan, Lei, Heidi, Rolnick, Itai, Grishchenko, Ivan, Orsini, Manu, Kastelic, Matej, Zuluaga, Mauricio, Verzetti, Mauro, Dooley, Michael, Skopek, Ondrej, Ferrer, Rafael, Petridis, Savvas, Borsos, Zalán, Oord, Äaron van den, Eck, Douglas, Collins, Eli, Baldridge, Jason, Hume, Tom, Donahue, Chris, Han, Kehang, Roberts, Adam
We introduce a new class of generative models for music called live music models that produce a continuous stream of music in real-time with synchronized user control. We release Magenta RealTime, an open-weights live music model that can be steered using text or audio prompts to control acoustic style. On automatic metrics of music quality, Magenta RealTime outperforms other open-weights music generation models, despite using fewer parameters and offering first-of-its-kind live generation capabilities. We also release Lyria RealTime, an API-based model with extended controls, offering access to our most powerful model with wide prompt coverage. These models demonstrate a new paradigm for AI-assisted music creation that emphasizes human-in-the-loop interaction for live music performance.
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
LGM: Enhancing Large Language Models with Conceptual Meta-Relations and Iterative Retrieval
Lei, Wenchang, Zou, Ping, Wang, Yue, Sun, Feng, Zhao, Lei
Large language models (LLMs) exhibit strong semantic understanding, yet struggle when user instructions involve ambiguous or conceptually misaligned terms. We propose the Language Graph Model (LGM) to enhance conceptual clarity by extracting meta-relations-inheritance, alias, and composition-from natural language. The model further employs a reflection mechanism to validate these meta-relations. Leveraging a Concept Iterative Retrieval Algorithm, these relations and related descriptions are dynamically supplied to the LLM, improving its ability to interpret concepts and generate accurate responses. Unlike conventional Retrieval-Augmented Generation (RAG) approaches that rely on extended context windows, our method enables large language models to process texts of any length without the need for truncation. Experiments on standard benchmarks demonstrate that the LGM consistently outperforms existing RAG baselines.
- North America > United States (0.14)
- Europe > France (0.04)
- Asia > China > Hunan Province (0.04)
- (5 more...)
- Health & Medicine > Therapeutic Area (1.00)
- Health & Medicine > Consumer Health (1.00)
- Education > Health & Safety > School Nutrition (1.00)
- (2 more...)